Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge #289

Merged
merged 27 commits into from
May 5, 2023
Merged

Upstream merge #289

merged 27 commits into from
May 5, 2023

Conversation

lgruen
Copy link

@lgruen lgruen commented May 5, 2023

Intermediate upstream merge in between 0.2.115 and 0.2.116 to pick up hail-is#12929, which should fix our unit test failures (see https://centrepopgen.slack.com/archives/C03FZL2EF24/p1683251418792629?thread_ts=1683251237.683889&cid=C03FZL2EF24).

Please merge, don't squash.

danking and others added 27 commits April 26, 2023 15:49
…form project to track true current state (hail-is#12882)

I fixed some issues in `gcp` but there are enough difficult to resolve
issues that I want to start a fresh terraform project to track reality
as different from what we recommend a new user do.

The fresh attempt project is `gcp-broad`. I plan to slowly expand that
until it captures everything. From this point forward, for the subset of
infrastructure in `gcp-broad`, we should not make manual changes.
Instead, we should propose changes as PRs, review them, merge them, and
apply the terraform by hand once the changes are in `main`.
I should have checked quay.io for 1.12.0, which doesn't exist:
https://quay.io/repository/skopeo/stable?tab=tags&tag=v1.12.0

Even though 1.12.0 was released two weeks ago
https://github.com/containers/skopeo/releases
`credentials_host_file_path` is unused in `main`. Deleting that allowed
me to then delete `credentials_host_dirname`, after which
`CloudUserCredentials.file_name` and `CloudUserCredentials.secret_name`
were only used for `CloudUserCredentials.mount_path` so I merged those.
`CloudUserCredentials.secret_data` and
`CloudUserCredentials.hail_env_name` were unused.
NB, this is a stacked PR. To see just these changes see [this
commit](hail-is@ae51e0a)

---

[VPC Flow Logs](https://cloud.google.com/vpc/docs/flow-logs):

> VPC Flow Logs records a sample of network flows sent from and received
by VM instances, including
> instances used as Google Kubernetes Engine nodes. These logs can be
used for network monitoring,
> forensics, real-time security analysis, and expense optimization

I found the collection process the most elucidating part of the
documentation. My summary of that
process follows:

1. Packets are sampled on the network interface of a VM. Google claims
an average sampling rate of
1/30. This rate reduces if the VM is under load. This rate is immutable
to us.

2. Within an "aggregation interval", packets are aggregated into
"records" which are keyed (my term)
by source & destination. There are currently six choices for aggregation
interval: 5s, 30s, 1m,
   5m, 10m, and 15m.

3. Records are sampled. The sampling rate is a user configured floating
point number (precision
   unclear) between 0 and 1.

4. Metadata is optionally added to the records. The metadata captures
information about the source
and destination VM such as project id, VM name, zone, region, GKE pod,
GKE service, and geographic
information of external parties. The user may elect to receive all
metadata, no metadata, or a
   specific set of metadata fields.

5. The records are written to Google Cloud Logging.

The pricing of VPC Flow Logs is described at the [network pricing
page](https://cloud.google.com/vpc/network-pricing#network-telemetry).
Notice that, if logs are only sent to Cloud Logging (not to BigQuery,
Pub/Sub, or Cloud Storage):

> If you store your logs in Cloud Logging, logs generation charges are
waived, and only Logging charges apply.

I believe in this phrase "logs generation charges" refers to *VPC Flow
logs* generation charges. The Google Cloud Logging [pricing
page](https://cloud.google.com/stackdriver/pricing#google-clouds-operations-suite-pricing)
indicates that, after 50 GiB of free logs, the user is charged 0.50 USD
per GiB of logs. Storage is free for thirty days and 0.01 USD per GiB
for each additional day.

We can calculate the cost of our logs as follows. Refer to the
[definition of the record
format](https://cloud.google.com/vpc/docs/flow-logs#record_format) for
details.

```python3
ip_string = len("123.123.123.123")
ip_connection = 4 + ip_string + ip_string + 4 + 4
date_time = len("1937-01-01T12:00:27.87+00:20")
record_bytes = sum((
    ip_connection,
    max(len('SRC'), len('DEST')),
    8,
    8,
    8,
    date_time,
    date_time,
))
assert record_bytes == 126

hours_per_month = 24 * 60
seconds_per_hour = 60 * 60

seconds_per_interval = 15 * 60
vms = 10000
sampling_rate = 0.5
connections_per_vm_per_aggregation_interval = 100

intervals_per_hour = seconds_per_hour / seconds_per_interval
records_per_hour = intervals_per_hour * vms * connections_per_vm_per_aggregation_interval * sampling_rate
bytes_per_hour = records_per_hour * record_bytes
bytes_per_month = bytes_per_hour * hours_per_month
GiB_per_month = bytes_per_month / 1024. / 1024 / 1024

USD_per_month = max(0, GiB_per_month - 50) * 0.5

print(GiB_per_month)
print(USD_per_month)
```

This works out to 143 USD to run a 10,000 VM cluster 24 hours a day for
30 days.

I suspect our average VM count in a month is closer to 10 which is
within the free tier (340 MiB). I
might be wrong abou the connections per vm per aggregation interval, but
this is straightforward to
monitor once we have the logs.

For a sense of the cost landscape, these are all free:

1. 1000 VMs.
2. 500 VMs, with a sampling rate of 1.
3. 200 VMs, with a sampling rate of 1, with an interval of 5 minutes.
4. 10 VMs, with a sampling rate of 1, with an interval of 30 seconds.

It's all linear, so if we need to halve the interval we can either
change the sampling rate, reasses
our expected number of VM-hours, or adjust the service fee accordingly.

We can also assess the landscape of fees necessary to cover costs
(ignoring the free 50 GiB):

1. 15 minute intervals, 0.5 sampling rate, 100 expected connections per
vm per interval: 0.0000008
   USD per core per hour.

2. 30 second intervals, 1.0 sampling rate, 100 expected connections per
vm per interval: 0.00005 USD
   per core per hour.

2. 5 second intervals, 1.0 sampling rate, 100 expected connections per
vm per interval: 0.0003 USD
   per core per hour.

2. 5 second intervals, 1.0 sampling rate, 1000 expected connections per
vm per interval (1000 unique
connections per second honestly seems to me quite remarkable
performance): 0.003 USD per core per
   hour.

```
USD_per_core_per_hour = bytes_per_hour / vms / 1024. / 1024 / 1024 * 0.5 / 16

print(USD_per_core_per_hour)
```

---

# Conclusion

I think we're safe to enable this with the parameters in this PR (15
minute intervals, 50%
sampling). We can assess unknown parameters, like connections per vm,
and get comfortable looking at
these logs.

Security constraints or observability demands may push us towards
desiring more logs. If that
occurs, we can assess the need for a new fee. Regardless, this fee
appears to be small relative to
the current cost of preemptible cores.
…l-is#12944)

We always ensure that `cleanup` is run, but in order to ensure that
`post_job_complete` is run we need to ensure we get through any
computation in cleanup and `mark_complete` that could potentially hang.
So we add timeouts to any yield points in those functions and broadly
catch exceptions so we always continue in the cleanup/mark_complete
process regardless of failure.
There was an extra check in the where statement for the token matching
on the jobs billing table. The jobs billing table is the only one not
parameterized with a token.
CHANGELOG: `hl.Struct` now has a correct and useful implementation of
pprint.

For structs with keys that were not identifiers, we produced incorrect
`repr` output. For `pprint`, we just `pprint`'ed a dictionary (so you
cannot even tell that the object was an `hl.Struct`). This PR:

1. Fixes `hl.Struct.__str__` to use the kwargs or dictionary
representation based on whether the keys are Python identifiers.
2. Teaches `StructPrettyPrinter` to first try to `repr` the struct (this
is what the default pretty printer does)
3. Teaches `StructPrettyPrinter` to properly pretty print a struct as an
`hl.Struct` preferring the kwarg representation when appropriate.
4. Teaches `_same` to use pretty printing when showing differing
records.
This is a mitigation for turning off cloudfuse until we understand why
the unmount is not working properly.
Spark 3.3.0 uses log4j2. Note the "2". If you use the log4j1
programmatic reconfiguration system, you will break log4j2 for you and
everyone else. The only way to recover from such a breakage is to use
the log4j2 programmatic reconfiguration system.

Changes in this PR:

1. Include JVM output in error logs when the JVM crashes. This should
help debugging of JVM crashing in production until the JVM logs are
shown on a per-worker page.

2. JVMEntryway is now a real gradle project. I need to compile against
log4j, and I didn't want to do that by hand with `javac`. Ignore
gradlew, gradlew.bat, and gradle/wrapper, they're programmatically
generated by gradle.

3. Add logging to JVMEntryway. JVMEntryway now logs its arguments into
the QoB job log. I also log exceptions from the main thread or the
cancel thread into the job log. We also flush the logs after the main
thread completes, the cancel thread completes, and when the try-catch
exits. This should ensure that regardless of what goes wrong (even if
both threads fail to start) we at least see the arguments that the
JVMEntryway received.

4. Use log4j2 programmatic reconfiguration after every job. This
restores log4j2 to well enough working order that, *if you do not try to
reconfigure it using log4j1 programmatic configuration*, logs will work.
All old versions of Hail use log4j1 programmatic configuration. As a
result, **all old versions of Hail will still have no logs**. However,
new versions of Hail will log correctly even if an old version of Hail
used the JVM before it.

5. `QoBAppender`. This is how we always should have done logging. A
custom appender which we can flush and then redirect to a new file at
our whim. I followed the log4j2 best practices for creating a new
appender. All these annotations, factory methods, and managers are The
Right Way, for better or worse.

If we ever ban old versions of Hail from the cluster, then we can also
eliminate the log4j2 reconfiguration. New versions of Hail work fine
without any runtime log configuration (thanks to `QoBAppender`).

I would like to eliminate reconfiguration because log4j2 reconfiguration
leaves around oprhaned appenders and appender managers. Maybe I'm
implementing the Appender or Appender Manager interfaces wrong, but I've
read over that code a bunch of times and I cannot sort out what I am
missing.
See hail-is#12958.

---------

Co-authored-by: Daniel Goldstein <danielgold95@gmail.com>
…es to req generation (hail-is#12929)

CHANGELOG: Hail no longer officially supports Python 3.7.

Combines hail-is#12927 and
hail-is#12908.

Many changes. All seem to be necessary together.

---

I fixed any new mypy errors or deprecation warnings. I also cleaned up
plots.py (which isn't CI'd by mypy) because I was in there and it was a
mess. I unified all our pip-tools versions. Require pandas 2 now. That
requires Bokeh 3.x.x. Fix the pinned-requirements.txt dependencies so
they reflect the actual necessary runtime harmony. Upgraded Sphinx. The
method names lose their fixed-width styling but I think it looks fine.
Version policy added. Updating everything means Jinja2 can jump to the
latest version too. Numpy deprecated some of its types, using `_` gets
rid of the warning.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The old bucket did not use uniform access control and also was
multi-regional (us). I created a new bucket using the random suffix
ger0g which has uniform access control. I also switched the location to
us-central1 (not pictured here because that is a variable).

I copied all the JARs from `gs://hail-query/jars` to
`gs://hail-query-ger0g/jars` using a GCE VM.

Again, global-config is not present in our terraform, so I'll have to
manually edit that to reflect this new location:
`gs://hail-query-ger0g`.

The deployment process is:

1. Edit global-config to reflect new bucket.
2. Delete batch and batch-driver pods.
3. Delete old workers.

The rollback process (if necessary) is the same.

Since this requires wiping the workers, I'll wait for a time when no one
is on the cluster to do it.

Any users using explicit JAR URLs will need to switch to
`gs://hail-query-ger0g/...`.
…12960)

Applies the most restrictive bind and event propagation settings to job
container mounts. While user jobs do not have the capabilities to create
mount points, overlapping mount points in the container config can
inadvertently trigger mount propagation back to the host which we just
never want.
I updated terraform but

1. GCP Terraform state is still local on my laptop.

2. GCP Terraform appears to not configure global-config. As such, I
cannot thread the name of the bucket through to the tests the way we do
with TEST_STORAGE_URI. For now, I've hardcoded the name (which is what
we were doing previously). When we eventually get to testing recreation
of GCP in a new project we'll have to address the global config then.
hail-is@1940547
should have incremented the worker version. New QoB logging is
incompatible with old workers.
This PR just adds a test to make sure submounts don't cause deletion.
Because the python version in hail-ubuntu changed from 3.7 to 3.8, both
of the `hail-pip-installed` dockerfiles were running 3.8. I changed the
old 3.7 dockerfile to try to use 3.9
…-is#12979)

Skipping these for the timebeing because they currently only pass for
16-core worker VMs and not for 8-core worker VMs.
@lgruen lgruen requested a review from illusional May 5, 2023 02:00
@lgruen lgruen merged commit fc212f5 into main May 5, 2023
@lgruen lgruen deleted the upstream-0.2.115-intermediate branch May 5, 2023 02:01
@lgruen lgruen mentioned this pull request May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants